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Q^; Abstract 

d. 

• ^ . Rate variation among the sites of a molecular sequence is commonly found in appli- 

qh| cations of phylogenetic inference. Several approaches exist to account for this feature 

but they do not usually enable us to pinpoint the sites that evolve under one or another 

K^ ' rate of evolution in a straightforward manner. In this paper we concentrate on phyloge- 

OO ' 
cn . netic mixture models as tools for site classification. Our method does not rely on prior 

'^ I knowledge of site membership to classes or even the number of classes. Furthermore, 

f~^ ■ it does not require correlated sites to be next to one another in the sequence align- 

ment, unlike some phylogenetic hidden Markov or change-point models. We present 
a simulation study to show that our approach is able to correctly classify the sites to 
evolutionary classes and we analyse the popular alignment of the mitochondrial DNA 
of primates. In both examples, all mixtures outperform commonly-used models of 
among-site rate variation and models that do not account for rate heterogeneity. Our 
method for site classification is directly relevant to the profiling of genes with unknown 
function, and its application may lead to the discovery of partitions not otherwise 
recognised in the alignment. In addition, we discuss computational aspects including 
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the use of simple Markov chain Monte Carlo (MCMC) moves to estimate phylogenetic 
models and argue that these move types can mix efficiently without tempering the 
MCMC chains. 

Key words: among-site rate variation; allocation variable; Bayesian mixture model; 
Markov chain Monte Carlo; phylogeny 

1 Introduction 

Molecular phylogenetics is the inference and interpretation of evolutionary relations between 
taxa, typically different species or strains of viruses or bacteria, based on the taxa's DNA 
or protein sequences. The sequences are aligned on top of each other to form an alignment 
with as many rows as sequences observed, and roughly as many columns (or sites) as char- 
acters in the sequences. The conventional likelihood-based model for phylogeny inference 
(e.g. Felsenstein, 1981) contains three parameters of inferential interest: a tree graph that 
represents the evolutionary relations between the taxa; the branch lengths of this tree that 
measure the expected number of substitutions per site; and a stochastic process that models 
the evolution of the sequences along the branches of the tree (the latter is usually referred to 
as the evolutionary model). Such a model is complex but may still be too simple to capture 
important features of the generating process. In particular, it is not uncommon that sites 
under different functional constraints accumulate substitutions at different rates. It is now 
well understood that if rate variation among sites is present and not accounted for by the 
model, spurious parameter estimates can be produced (Huelsenbeck and Suchard, 2007 and 
references therein). 

Various approaches have been proposed to account for among-site rate variation in phy- 
logenetic inference, including the gamma model (Yang, 1993; 1994) and several more recent 
models involving finite mixtures of distributions (e.g. Pagel and Meade, 2004; Lartillot and 
Philippe, 2004; Huelsenbeck and Suchard, 2007; Webb, Hancock and Holmes, 2009). The 
latter type of models assume that a site is generated from a mixture of multiple processes, 
each of which may be indexed by a specific tree topology, a specific set of branch lengths 
and specific parameters of the stochastic evolutionary model. 

Rate variation among sites may be related to quantitative differences in the rates of 



substitution (e.g. sites with high rates versus sites with low rates) but also to qualitative 
differences in the pattern of substitution (e.g. sites with large transition/transversion rate 
ratios versus sites for which all substitution types occur at the same rate; Pagel and Meade, 
2004). In phylogenetic applications it is possible to find quantitative among-site rate varia- 
tion, qualitative variation, both or neither. Developments in phylogenetic mixture modelling 
have accounted for both types of rate variation and examples of this include Felsenstein and 
Churchill's approach (1996). They account for quantitative variation in substitution rates 
among sites by a hidden Markov process that operates along the alignment assigning rates to 
sites from a finite pool of values. This method incorporates the biologically realistic assump- 
tion of correlation between the rates of evolution at consecutive sites, so that the chance 
of neighbouring sites evolving under the same rate is higher than that of distant sites. A 
disadvantage of this assumption, however, is that possible biases may be introduced by the 
removal of sites involving gaps in the alignment, or by other errors that result in consecutive 
observable sites not being direct neighbours in reality. 

To model qualitative rate heterogeneity, Pagel and Meade (2004) use a Bayesian mixture 
of multiple stochastic evolutionary processes. Their model supposes that data at a given site 
arise from a mixture of multiple classes, each class indexed by a common-to-all-class tree 
and branch lengths, and a class-specific stochastic evolutionary process. The assumption of 
a common set of branch lengths across mixture components results in a phylogeny whose 
branches are a compromise over the possibly quite different substitution rates in the align- 
ment. This may miss important substitutional heterogeneity and so Pagel and Meade (2008) 
and Meade and Pagel (2008) consider extensions to their original model, this time allowing 
for multiple sets of branch lengths. 

Kolaczkowski and Thornton (2008) present a mixture similar to that of Meade and Pagel 
(2008), but conduct inference within a maximum-likelihood framework. 

A related approach, called the CAT model (Lartillot and Philippe, 2004), considers qual- 
itative mixtures of stochastic evolutionary processes which, for simplicity, all have the same 
set of substitution rates but different stationary probabilities. Inference on the number of 
classes is conducted using a Dirichlet process prior and the model is estimated via MCMC. In 
addition to moving in the space of usual phylogenetic parameters (e.g. tree topology, branch 
lengths), the MCMC sampler developed by Lartillot and Philippe also moves in the space of 



number of classes, jumping between mixtures of different dimensions as the run proceeds. 

Huelsenbeck and Suchard (2007) consider quantitative mixtures of branch lengths in 
which sites are partitioned into classes according to a Dirichlet process prior. Sites that are 
assigned to the same class share a common set of branch lengths, while all sites, irrespective 
of their class, share a common tree and stochastic evolutionary model. Both the number of 
classes and the assignment of sites to classes are treated as random variables and, together 
with the usual phylogenetic parameters, are objects of inferential interest. 

One aspect of mixture models that has been under-explored in the phylogenetics literature 
is their use for site classification through the introduction of latent allocation variables. The 
allocation variables identify the underlying class of a site and thus enable us to decompose 
the complicated structure of a mixture into simpler structures. In a phylogenetic context, 
mixture components may have a direct biological interpretation and site classification can 
lead to insights of structure and heterogeneity in the alignment that are not otherwise easily 
uncovered. 

The purpose of this study was, therefore, to extend the functionality of phylogenetic mix- 
ture models to include allocation variables and investigate their use for site classification. 
Lartillot and Philippe (2004) and Huelsenbeck and Suchard (2007) incorporate allocation 
variables, but straightforward statements about site classification are obscured by the esti- 
mation of the number of mixture components. Their MCMC samplers move in a space of 
mixtures of different dimensions, and so sites get allocated to an ever-varying number of 
components as the MCMC run progresses. The mixture in Pagel and Meade (2004) does not 
incorporate allocation variables and site classification therefore involves a posteriori process- 
ing of their analysis output. 

In our study we do not attempt inference on the number of mixture components but, 
instead, consider mixtures with fixed dimensionality. We develop a heuristic technique, 
based on Bayes factors and inspection of the analysis output, to select the number of mixture 
components and the type of mixture that best fits the data. We demonstrate model selection 
and site classification through applications to both synthetic and real DNA data. The results 
suggest that our method is able to detect heterogeneity in these data and classify the sites 
to evolutionary components with high accuracy. 

With regard to model estimation, we consider commonly-used MCMC move types that 



update tree topology and branch lengths en bloc and argue that this may have detrimental 
effects on the estimation performance of the sampler. We present an alternative set of 
move types to update the parameters of a mixture or non-mixture phylogenetic model, and 
investigate their performance. We show that our algorithm achieves the same, or greater, 
efficiency than existing methods with potential for a reduction in computational cost. 

2 Bayesian phylogenetic mixtures for site classification 

2.1 The models 

The backbone of likelihood-based phylogenetic methods is a homogeneous model that posits 
that the characters at a site in a DNA alignment are an independent realisation of a 
continuous-time Markov process, with state space X = {A,C,G,T}, that evolves on the 
branches of a bifurcating tree topology, 0, and has realisations at the leaves of this tree. 
The instantaneous rate matrix, Q, that generates the Markov process is indexed by a (pos- 
sibly vector) parameter 6. There are several proposed parametrisations of the Q-matrix in 
the literature (e.g. Jukes and Cantor, 1969; Hasegawa et al., 1985) with the most general 
time-reversible one called the GTR matrix (Lanave et al., 1984; Tavare, 1986), where 
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and Q = (r, it) is a collection of six substitution rates r = [tag, • • • ? i^gt) and four stationary 
probabilities vr = (vr^, . . . , tt^) with constraints r^, ttj > 1 and "^Tm = ^'n'i = ^ {m = 
AC, . . . , GT; i = A, . . . , T).The off-diagonal values of Q are non-negative while the diagonal 
entries are defined so that each row adds up to zero. This matrix is usually standardised so 
that Xliei ^fe^i ~ -'- (Felsenstein, 1981) and, as a result, the branch lengths are a measure 
of the expected number of substitutions per site in the molecular sequence alignment. The 
Markov process of character substitution is time-reversible, a feature that prevents us from 
inferring rooted trees. Thus, for an observed alignment of size S sequences x A^ sites, 
parameter takes values in the set of unrooted bifurcating leaf-labelled trees for S taxa; 



branch lengths are real valued; and the space in which parameter 9 takes values is dictated 
by the chosen parametrisation of matrix Q. The objective of the analysis is usually inference 
about the tree topology, 0, this tree's branch lengths (denoted by a set t = {ti, . . . ,^25-^3}) 
and 6. 

Building upon the homogeneous model, we account for among-site rate variation using a 
finite mixture of distributions of the type 

k 

Xn I w, 0, t,9,k ^ y^ Uj p{xn \ 0, tj, 9j), independently for n = 1, . . . , iV, (2) 

i=i 
where Xn is the character vector at site n; k is the number of mixture components; u = 

(wi, . . . ,ijJk) are the mixture proportions {uj > and J2j=i^j — 1); ^^ch component j (j = 

1, . . . ,k) has set of branch lengths tj and parameters of the Q- matrix 6j collectively denoted 

by t = (ti,. . .,tfc) and 6 = {6i, . . .,6k); and p(x„ | 0,ti,6'i),. . . ,p(x„ | (j),tk,6k) are the k 

component likelihoods. Model (|2]) thus asserts that characters at site n are generated from 

a mixture of k different evolutionary components occurring in proportions Ui, . . . ,Uk. To 

decompose the structure of this mixture, a set of latent allocation variables, z = {zi, . . . , zn), 

is introduced where each z„ G {1, . . . , fc} is such that 

Xn\ (l),t,6,k,Zn ^ p{xn\4>,tz„,6zj, independently fom = 1, ..., A^. (3) 

This formulation not only accounts for both quantitative and qualitative rate heterogene- 
ity, but also provides a means to class discovery by the use of z. In addition to classifying 
the sites to evolutionary components, mixture (|2]) also enables us to discern the profiles of 
each class by estimating the component-specific parameters. So, the analysis may lead to 
statements such as 'class 1 is more conserved than class 2 as the former displays a shorter 
total branch length than the latter' or 'the nucleotide composition of the two classes is quite 
different, as reflected by the estimated stationary probabilities'. 

The joint prior of all parameters can be expressed as 

p{u!, Z, 0, 9, t) = p{ljj)p{z I Uj)p{(f) I Z, Uj)p{6 I 0, Z, Uj)p{t I 6, 0, Z, (jj) 

= p{uj)p{z I uj)p{4>)p{t)p{e) (4) 



6 



where we have suppressed the exphcit conditioning on k because we consider only mixtures 
with a fixed number of components and make independence assumptions between all pa- 
rameters other than z and u. In our study, the prior for u is taken to be the symmetric 
Dirichlet distribution u ~ Dir^lp, ■ ■ ■ , p) (and we generally set p = 3). Conditional on u, 
the allocations zi, . . . ,ziy are assumed independent and identically distributed 

Pr{zn=j\uj)=ujj, j = l,...,k. (5) 

We make the following standard choices for the priors on phylogenetic parameters. All 
tree topologies are assumed to be equally likely a priori; that is, we take a discrete uniform 
prior for 0. The prior distribution for branch lengths makes an assumption that the 25* — 3 
branches for each of the k components behave independently both within components and 
across components. Exponential priors on individual branch lengths are specified, with 
exponential-rate parameter /3 (and we set /3 = 10 in line with similar published models) so 
that E(t/i j) = 1//3 for branch length h in the jth mixture component {h = 1, . . . ,2S — 3; 
j = 1, . . . ,k). For the parameter vectors 6 of the k instantaneous rate matrices, we assume 
independent prior distributions on each r^ and tcj of the form r^ ~ DirQ{l, . . . , 1) and tTj ~ 
Dtr,{l,...,l). 

Throughout, the model specified by ([2]) and ([3]) is referred to as the Q + t mixture model. 

In our examples, we will also consider nested submodels of the Q + t mixture. Firstly, 
we consider mixtures of multiple Q matrices which share a common set of branch lengths, t, 
and tree topology, 0, (Pagel and Meade, 2004): 

k 

Xn I cj, 0, t,9,k r^ 2_, ^j Pi^n I 0! t, Oj), independently for n = 1, . . . , A^. (6) 

i=i 
Restricting this further we also consider mixtures of branch lengths and Q matrices, 

but where the Q matrices across components share the same stationary probabilities, i.e. 

Oi = (ri,7r),...,6'fc = (rfc,7r): 

k 

Xn I w, 0, t,6,k r^ y^ Uj p{xn \ 0, tj, rj, it), independently for n = 1, . . . , iV. (7) 

i=i 
Both models can be augmented with allocation variables. We refer to model iQ and its 

corresponding augmented formulation as the Q mixture, and to model ([7]) and its augmented 
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version as the r + t mixture. 



2.2 Likelihood computation 



The hkehhood function under the most general Q + 1 mixture is the product of the distri- 
butions at individual sites (equation ([3])), from site 1 to N: 



N 



,t,e\x,z) = Ylp{xn\(j),t,^,e,j. 



n=l 

We assume that substitutions at different branches of the tree and among different sites 
in the alignment are independent of one another. Likelihood (|H]) is usually computed for a 
specific tree and so each tree topology requires a reformulation of this function according to its 
corresponding branching structure; the larger the tree the more computationally prohibitive 
the calculation. A recursive technique for the efficient computation of phylogenetic likelihood 
functions, called the pruning algorithm, was introduced by Felsenstein (1981), and this is 
the algorithm that we use. 

2.3 Model fitting 

Markov chain Monte Carlo (MCMC) will be required to fit models of this complexity and we 
present the basic move types in our MCMC sampler. A distinctive feature of our method is 
that changes to the topology are separated from those in branch lengths. This is particularly 
important for some of the mixtures, where the components share a common topology but 
have different sets of branch lengths. The set of move types that we use is: 

(a) updating the tree topology 0; 

(b) updating branch lengths t; 

(c) updating the substitution rates r; 

(d) updating the stationary probabilities vr; 

(e) updating the mixture proportions u; 

(f) updating allocation Zn. 

One complete pass over these six moves is an iteration, the basic time step of our algorithm. 
The first two move types focus on the tree while the next two concentrate on the parameters 



of the models on the tree; the last two move types concern the mixture allocations and 
proportions. We now consider the three groups separately in the context of the most general 
Q + t mixture model, before investigating their performance later in the paper. 

2.3.1 Updating the tree topology and branch lengths 

The tree topology is updated via the nearest neighbour interchange (NNI) (Robinson, 1971; 
Moore, Goodman and Barnabas, 1973), in which one of the two nearest neighbours of the 
current topology (in NNI space) is proposed with equal probability. NNI generates a can- 
didate topology while preserving the current set of branch lengths. A separate proposal 
mechanism is then used to update branch lengths while maintaining the same topology. We 
consider two different proposals for branch lengths: 

• Branch length multiplier (BLM). Also known as proportional shrinking and expanding 
(Yang, 2006), this proposal updates the length of a randomly chosen branch thj by 
multiplying it by a quantity generated from the density 

f{m) = {Xm)-\ l/6<m<6 (9) 

where A = 2 log S and S > 1 acts as a tuning parameter. 

• Branch length normal additive (BLNA). Also known as the sliding window proposal 
(Huelsenbeck and Ronquist, 2001), this mechanism updates a randomly chosen branch 
length th,j via an additive Gaussian perturbation, t'^ ■ ~ N{th,j, o"^), so that a^ acts as 
the tuning parameter. If negative branch lengths are proposed, they are reflected at 
zero with the proposal still remaining symmetric. 

BLM may be thought of as self-tuning as the variance of the proposed branch length 
is proportional to the square of the original length. This works well when exploring large 
branches but can be a bit sticky when branch lengths are small as it can take a large number of 
iterations to move a short distance. On the other hand, a candidate branch length generated 
from the BLNA proposal has a step size which depends only on the tuning parameter o"^ and 
not on the current branch length. This makes it hard for BLNA to work equally effectively 
at both large and small scales. In experiments, we achieved best performance by alternating 
between BLM and a BLNA tuned for small branch lengths (Supplementary Material I). 
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The acceptance probability of a branch length proposed from either BLM or BLNA is 

u ,> , ■ Ji P(4,j) L{(l),t',9\x,z) q{tl^,th,j) \ 

C((th j, ti, ,) = mm < I, — — ^ ^r^- Ti r 7 — - — ;-^ > 10 

'' '' \ Pith,j) L{^,t,9\x,z) q{th,jA,j)\ 

which simplifies to rne~^*'^'^~^^'^' for the BLM proposal and to e^ '-^'^-j^*'"'^'' for BLNA, times 
the likelihood ratio in both cases. 

2.3.2 Updating the Markov process parameters 

The jth component of the Q + t mixture has a set of parameters controlling the substitution 
rates plus a set of stationary probabilities, tacji ■ ■ ■ ^''^gtj and t^aj, ■ ■ ■ ,'^t,j, respectively. 
Since we can treat each mixture component separately for updating purposes, we drop the 
subscript j. Both types of parameters are constrained to sum to one and, as they utilise the 
same type of proposal, here we concentrate on the substitution rates. 

We suggest generating a new set of substitution rates, r', from a Dirichlet distribution 
centred at the current rate values with a positive shift e > and with tuning parameter 
a > 0; i.e. r' ~ Dir^la {tac + e), . . . ,a {tgt + e))- The variance of the mth element of a rate 
vector proposed with this move type, henceforth referred to as the eDirichlet proposal, is: 

varirj = ^ -— 11 

a5(ao + 1) 

where ao = J2m=Ac '^(''^™- + e) = « (1 + 6e). If e = (Larget and Simon, 1999), when r^ is 

close to zero so too is var{r'^). This can create a cycle in which the MCMC sampler keeps 

proposing candidate rates very close to zero because the step size of the proposal is nearly 

zero, typically needing many iterations to escape. We further investigate this behaviour in 

a later section. 

2.3.3 Updating the mixture parameters 

Updating the allocation variables and the vector of mixture proportions is a fairly standard 
problem in mixture modelling. The usual approach is to update the allocations one at a time 
using either a uniform Metropolis or a Gibbs proposal (the number of mixture components 
k tends to be small in our applications so this is feasible). The mixture proportions are 
usually updated using a Gibbs sampler since their posterior conditional is easily seen to be 
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a Dirichlet distribution with parameters p + iVi, . . . ,p + A^^, where Nj = X]n=i ^[^n = j] 
is the number of sites allocated to component j and /[■] is the indicator function. A well 
known difficulty of this combination of proposals is that they may mix badly when one 
or more components become quite small or when the other parameters characterising the 
components make it hard for a site to swap components (see Leslie, 2007 or Hurn et al., 2008 
for examples in quite different application areas). In the latter case, Leslie (2007) and Hurn 
et al. (2008) both suggest a joint updating strategy. However here we are primarily worried 
about the former case. Given our experience in updating r and tt, we again propose using a 
shifted Dirichlet approach, here replacing the Gibbs draw from a Dir^lp + Ni, . . . , p + N^) 
by a Metropolis-Hastings proposal, u' ~ Dirk{p + A?"! + e, . . . , p + A^^ + e) with e > 0. The 
acceptance probability of this move type simplifies to 

a{uj,uj') = mm\l,\[{uj,/uj'^Y\ (12) 

and so a high acceptance rate is maintained for small values of e. 

2.4 Model selection 

Turning to the decision of choosing which model to use for a particular set of data, Bayes 
factors can be used to summarise the evidence provided by the data in favour of one model 
relative to another (Kass and Raftery, 1995). When each model is equally likely a priori, 
the Bayes factor is defined as the ratio of the marginal likelihood under model Mi to the 
marginal likelihood under a second model, Mq, given the data, x. The marginal likelihood for 
model Mi is the expectation under the prior of the likelihood of the data x, all conditioned 
on the model M^ (or, equivalently, the integral over the parameters of the joint distribution 
of the data and the prior conditioned on the model). 



p{x\Mi)= / p{x\^,,Mi)p{^i\Mi)d^i (13) 

where di is the parameter vector of model Mj. Bayes factors are usually interpreted on 
the log scale using the rule of thumb that 2 \n{BF) > 10 indicates very strong evidence in 
favour of one of the models, < 2 ln.{BF) < 2 indicates no significant difference between 
the models, and with a range of levels in between according to a scale provided in Kass 

11 



and Raftery (1995). In examining more that two models, the values of the log marginal 
likelihoods may be compared directly rather than forming multiple Bayes factors. 

The integral in Equation ( IT3l) is typically intractable, except for the most elementary 
phylogenetic applications, and its estimation is the topic of considerable interest. A com- 
monly used estimator is the harmonic mean estimator of Newton and Raftery (1994). This is 
a form of importance sampling estimator, taking the posterior as its importance distribution 
which means it can be calculated from the MCMC chain used for fitting the model at little 
extra cost. The harmonic mean estimator is known to be relatively unstable although stud- 
ies suggest that it is accurate enough for interpretation on the log scale (Kass and Raftery, 
1995 and references therein). A recent paper by Xie et al., (2011) details its shortcomings, 
strongly suggesting instead a much more computationally expensive estimation approach. 
Given the already computationally intensive nature of our model fitting, we have chosen to 
work with the harmonic mean estimator since we are interested in log marginal likelihoods. 
However we do bear in mind its potential instability when choosing run lengths. 

3 Classification of simulated data 

To demonstrate the key features of our approach we generated a synthetic DNA alignment 
of size 16 sequences x 2 500 sites, with the software package Seq-Gen (Rambaut and Grassly, 
1997 ). Sites 1 — 1500 were generated from an evolutionary class with substitution rates 
n = {tac = 0.0500, Tag = 0.4000, tat = 0.0500, tcg = 0.0500, tct = 0.4000, tgt = 
0.0500}, stationary probabilities tti = {tta = 0.3220, tcc = 0.3040, ttg = 0.1080, ttt = 
0.2660} and total branch length Ti = 10. Sites 1501 — 2500 were simulated with r2 = 
{vac = 0.1009, TAG = 0.3645, tat = 0.1506, rcc = 0.0639, rcr = 0.3044, tgt = 0.0157}; 
7r2 = {tta = ■ ■ ■ = t^t = 0.2500} and T2 = 0.1. We arbitrarily labelled the former as class 
1 and the latter as class 2. Both classes were generated under the same tree topology and 
this was randomly sampled from the space of all unrooted bifurcating trees that relate 16 
sequences. In our experiments, the topology was held fixed at the generating topology. 

The intention here is to assess whether the classification method is able to detect the 
substitutional differences between the two classes and correctly allocate sites to evolutionary 
groups without prior knowledge of the partitioning. 
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Before the runs for inference, we conducted several exploratory runs to tune the MCMC 
proposals. We estimated each model at least twice and verified that each of these independent 
runs converged to the same region of the parameter space. Unless otherwise stated, the runs 
for inference comprised 60 000 iterations of our MCMC sampler, and we discarded the first 
quarter as burn-in. 

Figured] shows the estimated log marginal likelihoods for a range of mixture models; r + t, 
Q and Q + 1 with varying number of components. The log-likelihood for A; = 1 is common 
across all mixture types and it corresponds to a fit of the data with the homogeneous model; 
it is clear that any mixture performs better than the homogeneous model. For comparison, 
we also fitted a model in which the rates of substitution are allowed to follow a gamma 
distribution with four discrete categories (Yang, 1993; 1994). The log marginal likelihood of 
the discrete-gamma model, as this method is usually known, was estimated in —29 581. All 
two-component mixtures perform better than the discrete-gamma model which suggests that 
the substitutional heterogeneity in the data can only be adequately explained by multiple 
sets of branch lengths and more than one set of rate parameters. As expected, the log 
marginal likelihood reaches a plateau with a two-component Q + t mixture and continues 
to show improvements with some three- and four- component mixtures. A typical output 
from Q and Q + t mixtures with k > 2 displays partitions of the sites that are based on 
nucleotide composition (e.g. a component rich in one nucleotide type with the corresponding 
complementary component that is poor in that nucleotide), and which may be driving the rise 
in log-likelihood. Although the partition of sites by character composition is reasonable, we 
are interested in discovering class structure beyond the obvious differences. But perhaps more 
importantly, substitution rates cannot be reliably estimated for components that are poor in 
one or more nucleotide types. A fit with a r+t mixture, on the other hand, shows a negligible 
increase in log likelihood as the number of components augments. Indeed, r + t mixtures of 
sizes three and four show nearly empty components with very low weights. This suggests 
that, whenever the partition of the sites is not based on character composition, as in an r + t 
fit, two components are sufficient to describe the data. At two components, there is very 
strong evidence for a Q + 1 mixture {2ln{BFQ^tvsQ) = 1462 and 2ln{B Fq^tusr+t) = 1 196), 
and this is the model we use to discuss site classification. 

MCMC proposals for the classification run were tuned to values 6 = 1.5 for the BLM 
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Figure 1: Estimated log marginal likelihoods for the models fitted to the synthetic DNA 
alignment. MCMC runs for mixtures with four components comprised 20 000 iterations, 
with the first quarter discarded as burn-in. Plotted data: log-likelihood for a homogeneous 
model: -32 219; Q + t mixture with 2-4 components: -27368, -27134, -26 467; Q mixture 
with 2-4 components: —28 099, —27 563, —26 530; r + t mixture with 2-4 components: 
—27966, —27908, —27869; discrete-gamma model with 4 categories: —29 581. 

move; a = 0.08 for BLNA; ar = 900 and Ott = 700 for the eDirichlet proposal for substitution 
rates and stationary probabilities, respectively; eg = 0.0001 to correct the eDirichlet move 
and e^ = 0.0001 to correct the proposal for mixture proportions. 

Figure [2] shows the estimated posterior probabilities of classification to the first com- 
ponent ([X] details how these probabilities were obtained). High probabilities coincide with 
sites originally generated from class 1 while low probabilities match the positions generated 
from the second class. The crossover at which sites were simulated from different evolution- 
ary classes is strikingly well recovered by the method. Moreover, ergodic averages for the 
remaining parameters in the mixture coincide favourably with the generating values. 

4 Classification of mitochondrial DNA 

In a second application, we analysed an alignment of mitochondrial DNA (mtDNA) se- 
quences from the primate species human; gorilla; chimpanzee; orangutan; gibbon; crab- 
eating macaque; common squirrel monkey; Philippine tarsier and ring-tailed lemur (Brown 
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Figure 2: Estimated posterior probabilities of classification for the synthetic DNA alignment. 
High and low probabilities clearly demarcate the two classes used in generating the data. 

et al., 1982; Hayasaka et al., 1988). This alignment, of size 9 sequences x 888 sites, comprises 
the portions of two protein-coding genes (sites 1 — 694) and a transfer RNA (tRNA) region 
(sites 695 — 888). Transfer RNA is a highly conserved molecule in charge of translating the 
information encoded by coding genes into the protein alphabet. Such a translation process is 
achieved by mapping each set of three consecutive, non-overlapping DNA characters within 
a coding region into one amino acid. A coding DNA triplet is called a codon, and the second 
position {cp2) of a codon is known to undergo substitutions at slower rates than the first 
(cpl) and third codon positions {cp3; Fitch and Markowitz, 1970). This difference in sub- 
stitution rates relates to the fact that a change at the third codon position does not always 
affect the resulting protein but a change at cp2 may, more likely, alter the final product and 
result in a mutation. 

In this analysis, we are interested in detecting the evolutionary heterogeneity that exists 
between the different codon positions and the tRNA region. We also want to estimate the 
component-specific parameters in order to understand the features of the model that best 
discriminate between classes. 

The primate mtDNA alignment has been analysed extensively using phylogenetic meth- 
ods (e.g. Yang, 1995; Larget and Simon, 1999; Suchard et al., 2001), in most cases assuming 
four evolutionary classes (corresponding to the three codon positions plus the tRNA region). 
Most of these previous approaches have relied on prior knowledge about site membership 
which may be restrictive and error prone. For instance, in a study by Yang (1995), some 
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sites within the tRNA region were a priori misclassified resulting in inaccurate parameter 
estimates, as stated in the mtprim9.nuc file of the software package PAML4 (Yang, 2007). 

We fitted Q + t and Q mixtures with different number of components to the primate 
mtDNA alignment. Figure [3] shows the log marginal likelihoods estimated for each model 
from 45 000 iterations, following a burn-in period of 15 000 iterations. It is clear that the 
data contain heterogeneity that is not fully accounted for by either the homogeneous or the 
discrete-gamma models. A Q + t mixture with two components improves upon the homo- 
geneous model by 576 log-units and upon the discrete-gamma model by 414 log-units. A Q 
mixture with two components improves upon the homogeneous and discrete-gamma models 
even more, and greater improvements are observed as the number of mixture components 
augments for these two types of models. At first sight, the results point to the Q mixture 
with four components as the best model. However, a closer look at the analysis output 
reveals that the partition of sites into mixture components is, in some cases, mainly driven 
by the nucleotide composition of the sites. Figure H] shows the ergodic averages of stationary 
probabilities obtained from the analyses of the data with Q and Q + t mixtures with one 
to four components. All mixtures other than the Q + t with two components, contain a 
component that is poor in A nucleotides. 

To investigate further the partition of sites by nucleotide composition, we fitted the 
primate mtDNA alignment with a. r + t mixture. The log marginal likelihoods for a mixture 
of this type are shown in Figure [31 When the mixture components are not indexed by 
component-specific vrs (in contrast with both the Q and Q + 1 mixtures), the classification 
process is unable to separate the sites by their character content and the increase in log 
likelihood is more gradual. In fact, r + t mixtures of sizes three and four show poorly 
estimated components with very low weights which suggests that two components are enough 
to account for the rate variation in this alignment. Collectively, these results indicate that 
the large increase in log likelihoods ioT Q + t and Q mixtures of sizes three and four, relative 
to those with two components, is mainly due to uninteresting partitions of the sites by 
nucleotide composition. Figure H] shows that this is also the case for an analysis with a 
two-component Q mixture. We therefore decide on the Q + t mixture with two components 
to discuss site classification. 

MCMC proposals for the classification run were tuned from several exploratory runs to 
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Figure 3: Estimated log marginal likelihoods for the models fitted to the primate mtDNA 
alignment. The MCMC chain for mixtures with four components comprised 15 000 samples 
after discarding the initial 5 000 as burn-in. Plotted data: log-likelihood for a homogeneous 
model: —5 201; Q + t mixture with 2-4 components: —4625, —4412, —4 300; Q mixture 
with 2-4 components: —4 564, —4404, —4243; r + t mixture with 2-4 components: —4 710, 
—4 620, —4 557; discrete-gamma model with 4 categories: —5 039. 
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Figure 4: Ergodic averages of stationary probabilities from a fit of the primate mtDNA 
alignment with (a) a homogeneous model; (b) a Q mixture with two to four components; 
and (c) a Q + t mixture with two to four components. 
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values S = 1.5 for the BLM move; a = 0.06 for BLNA; a, = 800 and a^ = 600 for the 
eDirichlet proposal for substitution rates and stationary probabilities, respectively; ee = 
0.0001 to correct the eDirichlet move and e^j = 0.0001 to correct the proposal for mixture 
proportions. 

Table [1] reports the ergodic averages of model parameters and the estimated integrated 
autocorrelation times, f (|B] gives more details about r). The ergodic averages for the rates 
of substitution agree with the bias that favours transitions (a substitution from A — ;■ G 
or C — )■ T) over transversions (any other substitution). The two components are well- 
differentiated by their branch lengths; the second class evolves under a tree with a total 
branch length around ten times longer than that of the first component. Branch lengths are 
the quantities that best discriminate between components and so the inclusion of multiple 
sets of branch lengths seems to be reasonable. 

Figure [5] shows the estimated posterior classification probabilities of sites belonging to 
the second component (labelled as j =2 in Table [1]). For ease of visual interpretation, 
we have rearranged the protein-coding genes according to codon positions; sites 1 — 232 
correspond to cpl, sites 233 — 463 to cp2 and sites 464 — 694 to cp3, but there is nothing 
in the formulation of the classification method that requires such a rearrangement. Sites 
with high posterior probabilities in Figure |5] mostly occur within the cp3 and cpl regions 
and, previously, we found that the second component has substantially larger branch lengths 
than component 1. This profile agrees with the theory; the cpl and cp3 regions accumulate 
substitutions at higher rates than the more conserved cp2 and tRNA, which is captured by 
both the classification probabilities and the parameter estimates. 

The consensus tree topology, obtained as the 50% majority-rule, is shown in Figure |6] and 
it matches the published topologies in Yang (1995), Larget and Simon (1999) and Suchard 
et al. (2001). 

5 Performance of the MCMC moves 

We first compare our separated topology /branch length moves with the combined topology 
and branch length LOCAL proposal (Larget and Simon, 1999), which is implemented in the 
phylogenetic software package MrBayes (Huelsenbeck and Ronquist, 2001) and commonly 
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Figure 5: Estimated posterior classification probabilities for the primate mtDNA alignment, 
from an analysis with a two-component Q + t mixture. High classification probabilities occur 
at the cpl and cp3 regions, which are known to undergo substitutions at higher rates than 
the highly-conserved cp2 and tRNA segments. The effect of horizontal lines is due to the 
presence of sites with exactly the same character composition. 
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Figure 6: The 50% majority-rule consensus topology obtained from the chain of sampled 
topologies during the analysis of the primate mtDNA alignment with a two-component Q + t 
mixture. The posterior support for clades is shown on the branches. 
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Table 1: Ergodic averages of model parameters and estimated integrated autocorrelation 
times, r, from an analysis of the primate mtDNA alignment with a two-component Q + t 
mixture. Here j indexes the mixture components and the notation ^ intj and ^ extj refers 
to the total length of interior and exterior branches for the jth component. 

used for phylogenetic inference (e.g. Lartillot and Philippe, 2004; Nylander et al., 2004). 
Poor performance of LOCAL has been documented before for applications where only a 
few trees are supported by the data (e.g. Lakner et al., 2008) and our simulations agree 
with this observation. We consider a synthetic DNA alignment of 6 sequences x 2 500 
sites generated using the program Seq-Gen (Rambaut and Crassly, 1997) with parameters: 
r = {tac = 0.140, TAG = 0.340, tat = 0.090, tcg = 0.008, tct = 0.420, tgt = 0.002}; 
TT = {tta = ■ ■ ■ = tit = 0.25} and {(f), t) = {{{Taxon2 : 0.16, Taa;on4 : 0.34) : 0.61, Taxon5 : 
0.2) : 0.53, (Taa;on3 : 0.48, Taxon6 : 2.14) : 0.35, Taxonl : 0.05). A homogeneous model 
was fitted with the substitution rates and stationary probabilities held fixed at values very 
close to their generating values. We extended the comparison to include MrBayes' inbuilt 
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implementation of Metropolis-coupled MCMC ((MC)^; Geyer, 1991) which is a technique to 
improve the movement of an MCMC sampler by coupling together a series of successively 
easier-to-sample distributions, but at a correspondingly higher computational cost. We opted 
for MrBayes' default tuning of the LOCAL proposal and used two tempered chains when 
applying (MC)^. 

Figure [7] shows the trace plots and corresponding autocorrelation functions monitoring 
one aspect of the output, the total branch length T (we use the total rather than individual 
lengths as interior branches are not consistently labelled across the different topologies visited 
during an MCMC run). It is clear that Metropolis-coupling the LOCAL proposal improves 
performance (compare Figure [Tl^b) with (d)). It is even clearer, however, that the separate 
topology /branch length moves are the best choice here (see Figure [7|(f)). We can quantify 
the comparison of the methods by estimating the integrated autocorrelation time, r (e.g. 
Green and Han, 1992); r gives the factor by which a particular MCMC algorithm increases 
the variance of ergodic averages over those obtained by independent sampling. The smaller 
r, the better the MCMC algorithm. In this example, the MCMC LOCAL algorithm gives 
f = 115.54, (MC)^ LOCAL has f = 69.21 while our proposed combination of moves has 
f = 12.01. However, any reduction in f should be weighed against the computational costs 
of the different approaches. The MCMC LOCAL took 110 seconds, (MC)^ LOCAL 201 
seconds while our proposed combination of moves took considerably more at 1740 seconds. 
In other words, we can decrease the variance by a factor of 115.54/12.01 = 9.62 by using our 
proposals instead of LOCAL but it will take 1740/110 = 15.8 times longer. This suggests it 
would be better simply to run the less efficient LOCAL for more iterations. However this is 
a hard comparison to make entirely fairly because MrBayes is a commercial package while 
our code is not; a speed-up of 2 in our code would reverse this conclusion. We also note here 
that the Metropolis-coupling of LOCAL decreases the variance by a factor of just 1.67 while 
increasing the time taken by a factor of 1.83. 

One reason for the poor performance of LOCAL is that this proposal generates a new 
set of branch lengths and, as a by-product, it may also generate a new topology as part of 
the same updating step. The LOCAL proposal can make large moves in topology space, but 
when this move is not supported by the posterior distribution both the candidate topology 
and any change in branch lengths will be rejected. Therefore, when the data support only 
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Figure 7: Trace plots of total branch length T, for 15 000 samples after burn-in. (a) MCMC 
with LOCAL proposal; (c) (MC)=' with LOCAL proposal; (e) MCMC with NNI, BLNA, 
BLM updates, (b), (d), (f) Autocorrelation functions for samples of T output by (a), (c) 
and (e), respectively. 
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a few trees, the limited movement of the chain in topology space will restrict the movement 
in branch-length space, causing bad performance of the sampler. 

Turning to the performance of the eDirichlet proposal for updating r and tt, we considered 
the same synthetic alignment as before. To isolate the effects of the eDirichlet proposal, the 
topology and branch lengths were held fixed at their generating values and only vr and r 
were updated. We monitored the rcr chain. 

Figures El^a) and (b) show the traceplot and autocorrelation function, respectively, of the 
post-burn-in samples generated from a MrBayes run effectively implementing the version of 
the eDirichlet proposal with e = (Larget and Simon, 1999). It is clear that there are some 
mixing problems and this is reflected in the estimated integrated autocorrelation f = 474 
(with the run taking 51 seconds). 

As a computational comparison and a demonstration that this is not a rare event, fig- 
ures [H](e) and (f) show the equivalent plots with the same proposal implemented in our own 
code; here r = 137 (taking 354 seconds). It is possible to avoid the poor mixing by deploying 
(MC)^. Figures [8](c) and (d) show the traceplot and autocorrelation function, respectively, 
for MrBayes' coupled version. Coupling with one additional chain improves the mixing in 
this example (f = 46) but at the increased computational cost from the original 51 seconds 
to 92 (although this doubling of time is clearly worthwhile for a reduction of 10 in f). 

Finally, Figures [8]^g) and (h) are plots of r^r updated with the eDirichlet proposal using 
e = 1 X 10~^. The stickiness at zero observed with the non-shifted Dirichlet proposal is 
avoided; here f = 52 at virtually no extra computational cost, 366 seconds as opposed to the 
e = cost of 354 seconds with our code. This demonstrates that our eDirichlet proposal is 
considerably cheaper in terms of computational cost relative to a tempered MCMC solution, 
and that it effectively removes the stickiness of the chain at values close to zero. 

In order to assess the sensitivity of the e-corrected proposal to the choice of e, we per- 
formed a number of runs for a range of values for e (keeping all other tuning parameters 
and initial values unchanged). Table [2] reports the ergodic averages of rate tgt and the es- 
timated integrated autocorrelation times, f, corresponding to 15 000 samples after burn-in. 
The integrated autocorrelation times indicate that the worst-performing samplers are those 
with e = (corresponding to the non-corrected proposal) and e > 0.002 (corresponding to 
a case where e is greater than the generating tct)- The poor performance of the latter is 
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Figure 8: Traceplots oircT, for 15 000 samples after burn-in. (a) MrBayes with Dirichlet pro- 
posal; (c) MrBayes (MC)^ with Dirichlet proposal; (e) authors' code with Dirichlet proposal; 
and (g) authors' code with eDirichlet proposal. (b),(d),(f),(h) Autocorrelation functions of 
sampled rcr, generated with methods (a), (c), (e) and (g), respectively. In all cases, tuning 
parameter a was set to 500. 



24 



e 


rcT 


r{rGT) 





0.0027 


137 


1 X 10-^ 


0.0028 


71 


1 X 10"^ 


0.0029 


70 


1 X 10-^ 


0.0029 


80 


1 X 10-5 


0.0029 


52 


1 X 10-4 


0.0029 


60 


1 X 10"3 


0.0028 


68 


2 X 10-3 


0.0029 


66 


3 X 10-3 


0.0027 


156 



Table 2: Ergodic averages for rate tgt (generating value 0.002) and estimated integrated 
autocorrelation times f for 15 000 samples after burn-in. 

because the proposal mechanism makes sufficiently large steps away from the bulk of the 
posterior support that many are rejected. Between these two extremes, there is considerable 
stability which dispels any potential worries over the estimation of very small rates. Overall, 
this shifted Dirichlet proposal seems a rather efficient approach. 

6 Discussion 

We have presented a classification method for molecular sequence data that employs mixture 
models augmented with allocation variables. Our method contrasts with a common approach 
that assigns different phylogenetic models to individual, a priori-known partitions of the 
data, later combining the partitions into a single composite model (e.g. Nylander et al., 
2004). Partitioning the data prior to analysis makes two key assumptions: (1) the classes 
are known, and (2) there are as many classes as partitions in the alignment, which may 
result in overparametrisation. By contrast, our classification approach via mixture models 
enables us to discover the most appropriate segmentation of sites conditional on the model, 
and to make statements about the most important factors of substitutional heterogeneity in 
the data (e.g. is it the branch lengths that drive the segmentation in the data or the pattern 
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of substitution rates?). 

Our classification metliod is directly relevant to the investigator who has a concatenation 
of multiple genes but not prior information about which of these genes can be appropriately 
described by the same parameters. It may equally be used as a tool for gene profilling: when 
having a set of reference genes with known functions and a query gene with an unknown 
profile, our method can be used to study the relation between the query gene and the refer- 
ence genes. This should allow for statements about evolutionary similarities or dissimilarities 
between the reference and the query genes, leading to a clearer picture of the query gene's 
evolutionary identity. 

Our method accounts for both qualitative and quantitative among-site rate variation by 
considering mixtures with multiple sets of branch lengths and Q matrices. Mixtures with 
multiple sets of branch lengths further account for a phenomenon known as heterotachy 
(Lopez, Casane and Philippe, 2002), in which the rates of evolution along branches leading 
to different taxa in the tree vary across sites. In fact, since the beginning of this research a 
number of groups have independently proposed mixtures of sets of branch lengths as a way 
of modelling heterotachy in phylogenetic studies (eg. Pagel and Meade, 2008). 

An area of further work is the development of a mixture of overall-rates of substitution. 
An overall-rate for the jth component acts as a scaling factor of the branch lengths so 
that different components have trees with proportionally scaled branch lengths. Such a 
formulation would allow for a direct comparison with the popular discrete-gamma model 
(Yang, 1994), in which the overall-rates are assumed to conform to a Gamma distribution 
and all the components are constrained to have equal relative sizes. In contrast, a mixture 
of overall-rates allows more flexibility by letting the data support different relative sizes of 
the components and by avoiding distributional assumptions on the overall-rates; it further 
enables site classification by the inclusion of allocation variables, a feature that the discrete- 
gamma model lacks. However, the advantages of a mixture would have to be weighted 
against the convenient fact that a discrete-gamma formulation incorporates only one extra 
parameter relative to a homogeneous model. At present we are not aware of any study that 
has systematically compared the discrete-gamma model with a mixture on overall-rates and 
we believe that this is an interesting area of future research. 

A further potential application of our classification method is as a tool for identifying 
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the sites that are unable to undergo substitution. The presence of invariant sites is a well- 
documented cause of inconsistency in phylogeny reconstruction (e.g. Steel et al., 2000), and 
site classification could be employed to pinpoint the invariant sites that should be excluded 
from the alignment before inference. This idea was previously discussed in Huelsenbeck and 
Suchard (2007), and it would require to define a mixture that includes a strictly invariant 
class (i.e. a class with total tree length constrained to being zero). 

The MCMC methods discussed in this paper have been coded in a C program and source 
files are available upon request. 

A Inference on allocation variables 

Our MCMC sampler simulates a Markov chain of mixture parameters and a chain of alloca- 
tion variables from the joint posterior distribution of all model parameters. Once the chains 
have been produced and checked for convergence to stationarity, good mixing and lack of 
label-switching, they can be used to make reliable inferences about the posterior distribution. 
In this appendix we describe ways of summarising the chain of allocation variables. 

Consider a sample %,...,% of the allocation for site n, generated from an MCMC 
run of length M after burn-in. Variable Zn indicates the identity of the component to 
which site n is allocated at iteration i and it takes values in the set {1, . . . , k}. The sample 
Zn 1 ■ ■ ■ 1 Zn can be used to count the number of times that site n was allocated to component 
j throughout the run. This frequency count, divided by the total number of samples, M, 
gives the posterior classification probability of site n to component j. 

B Estimation of integrated autocorrelation times 

The integrated autocorrelation time, T{f), provides the means to compare different MCMC 
methods: to optimise the accuracy of estimation one could choose a method with the smallest 
possible t(/). However, when comparing two methods, we must not only consider the 
accuracy of estimation but also the computational cost. In our implementation, we use 
Geyer (1992) initial positive sequence estimator to estimate t(/): 
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K 

f(/) = -l + 2^f„ 

i=0 



where f , = p2i + P2J+1 is the sum of adjacent pairs of sample autocorrelations and pt is the 
autocorrelation at lag t. Here K is chosen to be the largest integer such that Fj > for 
i = 0,l,...,ir. 

The integrated autocorrelation time encodes the information about the correlation struc- 
ture of the chain; the greater the correlation between the samples the larger the t(/). It is 
in this sense that r(/) is a measure of the ability of the chain to move agilely around the 
support of the posterior distribution. A rapidly moving (or mixing) chain produces more 
reliable estimates than a slowly mixing one (for a fixed number of iterations), and the former 
is usually preferred over the latter as long as the computational cost of the rapidly-mixing 
chain is not prohibitive. 
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Supplementary Material: Alternating between BLNA and BLM, 
or using only one? 

One of the advantages of an alternating BLNA&BLM mechanism for proposing branch lengths 
is that one move can be tuned to generate modest steps while the other to produce bolder ones. 
In this section, we investigate the performance of single BLNA or BLM updates, compared to 
a sampler that uses both of them in an alternated manner. To do so, we produced a synthetic 

o : 

^ , DNA alignment of size 6 sequences x 2500 sites, generated with the software package Seq-Gen 



\o 



cr: 



(Rambaut and Crassly, 1997) under the GTR model of nucleotide substitution. The values used 
to generate this alignment are the following: the phylogenetic tree and branch lengths in Newick 



I formalu are: {{{Taxon2 : 0.16, TaxonA : 0.34) : 0.61, TaxonB : 0.2) : 0.53, {TaxonS : 0.48, Taxon6 : 

2.14) : 0.35, Taxonl : 0.05), where {Taxonl, . . . ,Taxon6} is the set of leaf-labels and the numbers 
^Jr[ correspond to the lengths of the branches; the vector of stationary probabilities is tt = [tta = . . . = 

2 ' ttg = 0.25) and the substitution rates are r = {rAc = 0.140, r^G = 0.3A0, tat = 0.090, rcc = 

0.008, rcT = 0.420, rcT = 0.002). 

In this exercise, we fixed r, vr and the tree to their true values. The target distribution is the joint 
posterior for branch lengths. We generated candidate branch lengths according to three different 
methods: (A) from a BLNA proposal; (B) from a BLM proposal; and (C) from an alternating 
BLNA&BLM scheme. In the alternating BLNA&BLM scheme, candidate branch lengths were 



> 

OO 

m 
m 
in 

C"^ ' generated from the BLNA proposal at even iterations and from the BLM proposal at odd ones. 

o : 

The justification for alternating moves and still converging to the target distribution is given by the 
fact that if chains P and R have the same stationary distribution, so does PR (see, for example, 

•T-H , 

S^ . Grimmett and Stirzaker, 2004). 

H ' 

Cd I We produced three replicates under each method, varying the tuning parameters of the BLM 

move {6 parameter) and BLNA (a parameter). The settings for these replicates are shown in Ta- 
bled! 



Table [2] reports the ergodic averages and estimated integrated autocorrelation times for each 
of the six exterior branch lengths ti,...,tQ, based on 15 000 samples after a burn-in period of 
5 000 iterations. We only report exterior branch lengths since interior branches are not uniquely 
labelled across different tree topologies. On the top-right of Tabled we have indicated the replicate 



The Newick format is widely used in phylogenetics for representing trees in computer-readable form. It makes use 
of the correspondence between trees and nested parentheses. This format is further described in Felsenstein, 2004. 



replicate 


6 


a 


(i) 


1.1 


0.08 


(ii) 


1.3 


0.06 


(iii) 


1.5 


0.04 



Table 1: Values for the 6 tuning parameter of the BLM move and the a tuning parameter of the 
BLNA proposal. These parameters were varied for three different replicates, (i), (ii) and (iii). 

number and the same order applies for all branches. Note the better performance of BLNA (small 
f) relative to BLM in estimating E(ti|x). The results suggest unsuitability of BLM for estimating 
short branches, which can be further investigated by calculating the expected value and variance 
of a branch length with respect to the BLM proposal. 

In a BLM move, a candidate length is generated as b' = bm, where b is the current length and 
m is a random variable with density function f{m) = (Xm)^^, 1/6 < m < 6. The expected value 
and variance of b' are given by 



Efib') = bEfim) 

-AH 



Varf{b') ^b^Varf{m) 



^' M^'-i 



1 

A2 



S- 



<b' <Sb 



(1) 



where A = 21og(5) and (5 > 1 is a tuning parameter. In the limit 6^0, the expected value and 
the variance of the candidate length approach Ej(6') —?■ and Varf{b') —?■ 0. This produces a 
phenomenon in which the chain is unable to move away from the zero neighbourhood, which we 
have dubbed 'zero-stickiness'. A phenomenon like this results in poor estimation performance, since 
the chain spends several iterations trapped at a small neighbourhood of the state space, producing 
MCMC samples that are highly correlated to one another. 

On the other hand, a candidate branch length is generated from a BLNA proposal as b' = b+au, 
where u ^ N{0, 1) and o" > is the tuning parameter. Under this proposal, the variance of b' does 
not depend on the current branch length and the step-size of the move is not influenced but by a. 





true length 


(A) BLNA 


(B) BLM 


(C) BLNA & BLM 




















average 


f 


average 


f 


average 


f 


h 


0.05 


0.070 


11.01 


0.071 


38.01 


0.069 


14.06 (i) 






0.069 


14.18 


0.066 


251.84 


0.070 


18.30 (ii) 






0.070 


13.36 


0.070 


24.41 


0.071 


13.63 (iii) 


t2 


0.16 


0.159 


10.88 


0.157 


9.80 


0.157 


11.76 






0.158 


17.28 


0.157 


26.73 


0.156 


19.65 






0.158 


9.69 


0.157 


9.73 


0.157 


8.83 


ts 


0.48 


0.469 


17.40 


0.468 


22.53 


0.469 


25.53 






0.467 


21.41 


0.465 


56.76 


0.470 


27.70 






0.470 


29.28 


0.470 


26.95 


0.469 


30.80 


u 


0.34 


0.310 


10.05 


0.313 


10.05 


0.312 


11.01 






0.311 


16.52 


0.313 


19.88 


0.312 


14.39 






0.311 


11.53 


0.312 


11.28 


0.312 


10.52 


ts 


0.20 


0.188 


9.50 


0.188 


11.61 


0.187 


11.08 






0.188 


10.65 


0.188 


35.64 


0.188 


17.04 






0.187 


10.53 


0.189 


9.87 


0.187 


12.20 


h 


2.14 


2.074 


36.17 


2.076 


17.21 


2.069 


21.56 






2.081 


25.97 


2.071 


18.12 


2.072 


20.39 






2.066 


76.32 


2.074 


19.89 


2.075 


21.73 



Table 2: The ergodic averages for exterior branch lengths and the estimated integrated autocor- 
relation time (f), for three different branch- length updating methods: (A) BLNA proposal; (B) 
BLM proposal; and (C) an alternating BLNA&BLM scheme. For each method, three replicates 
were performed, each replicate with different tuning parameters. The replicate number is indicated 
on the top right-hand-side of the table, and this same order applies for all branches. The results 
correspond to 15 000 samples after burn-in. All runs were initialised at the same starting point. 
The average execution time (across replicates) for the three methods were: (A) 3 300, (B) 3 320, 
and (C) 3 000; all measured in seconds. 



In estimating E,{tQ\x), BLNA performs poorly relative to BLM (see Table [2]). In other words, when 
the true branch length is long, BLM outperforms BLNA in all replicates. We believe that this 
might be due to the fact that the step-size of BLM depends on the current branch length whereas 
that of BLNA does not. A method that alternates between BLNA and BLM inherits the good 
properties of both proposals. The results for the combined BLNA&BLM, in Table [21 highlight the 
good estimation performance of such a strategy and justify our preference for alternating between 
BLNA and BLM when updating branch lengths. 
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